The problem explored in this project is the identification of a given user based on their typing patterns. Here, we used 3 datasets (found in the data folder under the names 'keystroke0' (the CMU dataset), 'keystroke1' (the standard dataset), 'keystroke2' (the multi-password dataset), and 'keystroke3' (the demographics dataset). The latter 3 come from the GREYC lab.
In this project, we will run an SVM classifier on several configurations. First, we will attempt to use an n-class classifier, where n is the number of users in total for the dataset. Here we will try to precisely detect from the test set which of the n users it came from.
Next, we will run a two-class classifier, in an attempt to determine with a simple binary whether the input provided has come from the genuine user or from one of the impostors. This is a simpler problem because it does not need to be as specific, but it is more practical for real-world applications as keyboard biometrics are really only interesting for ensuring that the typist of the password is in fact the authenticated user.
Finally, we want to detect not only the proportion of time that the classifier gets the correct answer with respect to the genuine user and the impostor, but also to actually focus in on the false positives (when an impostor gets in) and the false negatives (when the genuine user is erroneously blocked from the system). For this portion, we will attempt to construct an ROC curve to visually present the true and false positive rates for the detections.
Support vector machines (SVMs) are supervised learning models. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm encodes every input as some p-dimensional vector and computes a (p-1)-dimensional hyperplane to separate these points into two groups. New datapoints are then assigned a similar p-dimensional vector which then falls one one or the other side of the hyperplane and the point is thus classified.
The SVM creates a margin spearating the data. The margin is defined by the distance from the hyperplane to the nearest point on either side of the hyperplane. If the data is linearly separable, it will be possible to create a "hard margin" meaning the data will be evenly split into two halves and there will be no data in the margin at all. This is a very good classifier but the data is not always this nice. More commonly, a "soft margin" is used which simply has a penalty factor on the points that fall within the margin and minimizes this penalty to determine the best hyperplane.
The hyperplane's shape is determined by the "kernel" function used. The kernel can be linear, polynomial, or using a Gaussian radial basis function (rbf). Using a kernel will morph the plane from a linear shape to some curve which will often allow for better classification depending on the type of data and the parameters of the curve.
For multi-class classification, we use the standard dual-class SVM but take a one vs all or a one vs one approach to detemine the one class that the given test data point fits best. In one vs all, n classifiers are created to compare the class x vs the other classes. The test point is run through each of these classifiers and the one with the highest probability of being in the class rather than being in the impostor group wins. In one vs one, each class is pitted against the others and a winner-takees-all strategy is employed until the best one is found.
SVMs have been proven good for text categorization problems and bioinformatics. They are also known to perform well for large datasets. They are best for linearly separable data, though it's not certain whether the datasets looked at in this project have this quality.
In [1]:
%matplotlib inline
import csv
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
import seaborn as sns
In [2]:
def score(clf, X_test, y_test):
idx = 0
score = 0
for x in X_test:
y_pred = clf.predict([x])
if y_pred == y_test[idx]:
score += 1
idx+=1
return 100*score/idx
In [3]:
def score_complex(clf,X_test,Y_test):
count = 0
tp = 0
fp = 0
tn = 0
fn = 0
user = Y_test.count(1)
impostor = Y_test.count(0)
for x in X_test:
y_hat = clf.predict([x])
if Y_test[count] == 1 and y_hat == 1:
tp += 1
if Y_test[count] == 1 and y_hat == 0:
fn += 1
if Y_test[count] == 0 and y_hat == 1:
fp += 1
if Y_test[count] == 0 and y_hat == 0:
tn += 1
count += 1
return (round(tp/user*100,2),round(tn/impostor*100,2),round(fp/impostor*100,2),round(fn/user*100,2))
In [38]:
filename = "../data/keystroke0.csv"
In [39]:
X = []
Y = []
with open(filename, 'r') as f:
reader = csv.reader(f)
first = 1
for row in reader:
if first:
first = 0
continue
X.append(row[3:])
Y.append(int(row[0][1:]))
In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train)
Out[31]:
In [7]:
print(str(score(clf, X_test, y_test)) + '%')
Now we normalize the data with mean 0 and variance 1. We get a significantly better fit. We will normalize all datasets in this way from here on.
In [40]:
X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train)
print(str(score(clf, X_test, y_test)) + '%')
In [9]:
# Test different Csprint(str(score(clf, X_test, y_test)) + '%')
results = []
Cs = [1,25,50,75,100]
for i in [1,25,50,75,100]:
results.append(score(svm.SVC(C = i).fit(X_train, y_train), X_test, y_test))
print(Cs)
print(results)
In [10]:
# Test different kernels
for i in ['linear','rbf','poly','sigmoid']:
print(i + ': ' + str(score(svm.SVC(kernel = i).fit(X_train, y_train), X_test, y_test)) + '%')
RBF is clearly the best kernel in this case (which is also the default), so we will use it for all remaining calculations on this data set.
Clearly, SVM is fairly accurate at identifying the 57 users in the database based on their keyboard biometrics. However, much more important that this metric is the algorithm's ability to distinguish one specified user from the other 56, as would be the case if the user wanted to use this biometric as a means of authenticating his/her identity before logging into a computer, phone, etc.
In [11]:
genuines = [2, 5, 13, 39]
ys = []
for g in genuines:
# Genuine = 1, Impostor = 0
newY = []
for elt in Y:
if elt == g:
newY.append(1)
else:
newY.append(0)
ys.append(newY)
In [12]:
# Make sure we have exactly 400 1s in each list
for i in range(len(genuines)):
print(sum(ys[i]))
In [13]:
scores = []
for y in ys:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train)
scores.append(str(score(clf, X_test, y_test)) + '%')
print(scores)
VERY Weird. or maybe not that weird... they're not exactly the same, just very close, which kinda makes sense given what you were saying the other day. It's going to be almost entirely correct because there are so many 0s relative to 1s.
In [41]:
true_positives = []
false_negatives = []
true_negatives = []
false_positives = []
count = 0
for i in [x for x in range(2,58) if x in Y]:
count += 1
user = [1 if x == i else 0 for x in Y]
X_train, X_test, Y_train, Y_test = train_test_split(X, user, test_size=0.2, random_state=42)
clf = svm.SVC()
clf.fit(X_train,Y_train)
scores = score_complex(clf,X_test,Y_test)
true_positives.append(scores[0])
false_negatives.append(scores[3])
true_negatives.append(scores[1])
false_positives.append(scores[2])
# print('SUBJECT' + str(i) + ': \n True-Positive = ' + str(scores[0]) + '% \n False-Negative = ' + str(scores[3]) + '% \n True-Negative = ' + str(scores[1]) + '% \n False-Positive = ' + str(scores[2]) + '%')
print(count)
In [54]:
plt.figure()
sns.boxplot(true_positives,color = 'turquoise')
plt.title('True-Positive Rates for All Users')
plt.xlabel('TPR')
plt.show()
plt.figure()
sns.boxplot(false_positives,color = 'r')
plt.title('False-Positive Rates for All Users')
plt.xlabel('FPR')
plt.show()
Using SVM to validate the identity of each of the users against the others generates some interesting results. The TPR (i.e. the percentage of time the true user is attempting to log in and is identified as such) takes on quite a large range of values between different users, with a mean of ~78%. This means that 1/5 times the true user will be rejected from his/her own account, which is less than idea. The complimentary metirc, FPR (i.e. the percentage of time an impostor is attempting to log in as someone else and is incorrectly identified as the true user) takes on a much narrower range of values, and thankfully, has a much lower mean. The mean is ~5%, meaning that only 1/20 times will an impostor be able to get into someone else's account, and the data is left-skewed. This is still not a terrific metric, but is not bad considering the naivity of the approach.
Below is the basic code for creating some ROC curves. In this test I just made 6 different "folds", or cross-validation splits and compared them against each other. Ideally we would be comparing different C values, kernels, and datasets using this
In [16]:
# Gonna just use the first y since it doesn't seem to matter
yyy = np.array(ys[0])
XXX = np.array(X)
# More intense cross-validation
cv = StratifiedKFold(yyy, n_folds=6)
classifier = svm.SVC(kernel='linear', probability=True)
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
all_tpr = []
for i, (train, test) in enumerate(cv):
probas_ = classifier.fit(XXX[train], yyy[train]).predict_proba(XXX[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(yyy[test], probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')
mean_tpr /= len(cv)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, 'k--',
label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
In [17]:
filename = "../data/keystroke1.csv"
In [18]:
X = []
y = []
with open(filename, 'r') as f:
reader = csv.reader(f)
first = 1
for row in reader:
if first:
first = 0
continue
rw = list(map(lambda x: x/10000000., (map(int, row[5].split(' ')[1:61]))))
if len(rw) < 60:
continue
X.append(rw)
y.append(int(row[0]))
X = preprocessing.scale(X)
In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train)
Out[19]:
In [20]:
print(str(score(clf, X_test, y_test)) + '%')
This is the GREYC dataset where users are also classfified by demographics. It contains 110 users and each has a binary class assigned to them based on age (younger or older than 30), gender, and handedness.
The data is actually split into 2 files - 'keystroke3-1.csv' and 'keystroke3-2.csv'
Each of these has the typing of a different password
In [21]:
filename = "../data/keystroke3-1.csv"
In [22]:
X = []
Y = []
gender = []
age = []
handedness = []
with open(filename, 'r') as f:
reader = csv.reader(f)
first = 1
for row in reader:
if first:
first = 0
continue
rw = list(map(lambda x: x/10000000., (map(int, row[6].split(' ')[1:61]))))
if len(rw) < 5:
print(rw)
X.append(rw)
Y.append(int(row[0]))
gender.append(row[1])
age.append(int(row[2]))
handedness.append(row[3])
# normalize with mean 0 and variance 1
X = preprocessing.scale(X)
# 1 = female, 0 = male
gender = [0 if x == 'M' else 1 for x in gender]
# 1 = right, 0 = left
handedness = [0 if x == 'L' else 1 for x in handedness]
Data set demographics:
In [23]:
print('Right-Handed: ' + str(np.round(handedness.count(0)/len(handedness),2)*100) + '%')
print('Male: ' + str(np.round(gender.count(0)/len(gender),2)*100) + '%')
plt.figure()
plt.hist(age,bins = 15)
plt.title('Age of Subjects')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train)
print(str(score(clf, X_test, y_test)) + '%')
In [25]:
# test different kernels
for i in ['linear','rbf','poly','sigmoid']:
print(i + ': ' + str(score(svm.SVC(kernel = i).fit(X_train, y_train), X_test, y_test)) + '%')
In contrast to the first data set, it appears as though a linear kernel is more appropriate for this data set, so it will be used in all further calculations.
In [26]:
true_positives = []
false_negatives = []
true_negatives = []
false_positives = []
for i in [x for x in range(1,max(Y)) if x in Y]:
user = [1 if x == i else 0 for x in Y]
X_train, X_test, Y_train, Y_test = train_test_split(X, user, test_size=0.2, random_state=42)
if Y_test.count(1) == 0:
print(i)
clf = svm.SVC(kernel = 'linear')
clf.fit(X_train,Y_train)
scores = score_complex(clf,X_test,Y_test)
true_positives.append(scores[0])
false_negatives.append(scores[3])
true_negatives.append(scores[1])
false_positives.append(scores[2])
# print('SUBJECT' + str(i) + ': \n True-Positive = ' + str(scores[0]) + '% \n False-Negative = ' + str(scores[3]) + '% \n True-Negative = ' + str(scores[1]) + '% \n False-Positive = ' + str(scores[2]) + '%')
This is the GREYC dataset where users typed multiple passwords. It contains 118 users, each with a unique password. This dataset is complex in its organization and should only be used as a stretch goal - for each password it has the user info for that password and also impostor info for trying to get at the password. Everything here lives in keystroke3.tar.gz
The dataset is here and the associated paper that uses this dataset can he found here
It's not worth looking into this until we've done good work on the other datasets
In [ ]:
In [ ]:
import csv
import numpy as np
import sqlite3
from sklearn import svm
from sklearn.cross_validation import train_test_split
In [ ]:
filename = "../data/DSL-StrongPasswordData.csv"
In [ ]:
specs = []
users = []
with open(filename, 'rb') as f:
reader = csv.reader(f)
first = 1
for row in reader:
if first:
first = 0
continue
specs.append(row[3:])
users.append(int(row[0][1:]))
In [ ]:
users
In [ ]:
specs
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(specs, users, test_size=0.2, random_state=42)
In [ ]:
y_train
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(specs, users, test_size=0.2, random_state=42)
clf = svm.SVC()
clf.fit(X_train, y_train)
In [ ]:
idx = 0
score = 0
for x in X_test:
y_pred = clf.predict([x])
if y_pred == y_test[idx]:
score += 1
idx+=1
print str(100*score/idx) + "%"
In [ ]:
print score, idx
In [ ]:
np_specs = np.array(X_train).astype(np.float).ravel()
np_users = np.array(y_train)
print np_specs
print np_users
h = 0.02
x_min, x_max = np_specs.min() - 15, np_specs.max() + 16
y_min, y_max = np_users.min() - 15, np_users.max() + 16
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
In [ ]:
specs[:3]
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(specs, users, test_size=0.2, random_state=42)
clf = svm.SVC()
clf.fit(X_train, y_train)